The Habits of Goodreads Users

An Exploration of Ratings, Young Adult Novels, Tags, and More
Data Science 1 with R (STAT 301-1)

Author

Valerie Chu

Published

December 6, 2023

Introduction

What is Goodreads?

This is how Goodreads describes itself: “Goodreads is the world’s largest site for readers and book recommendations. Our mission is to help readers discover books they love and get more out of reading. Goodreads launched in January 2007.”

Among other unique features, Goodreads allows users to:

  • Rate books

  • Write book reviews

  • Track and tag the books they’re reading, have read, and want to read

Data Overview and Quality

What is my data about?

This dataset contains six million ratings for the 10,000 most rated books on Goodreads.

It was last updated on Sept. 19, 2017, so books published after that date won’t appear in this dataset.

It comes in five separate csv files: “books”, “books_tags”, “ratings”, “tags”, and “tbr”

(For context, users on Goodreads can tag books and add them to their shelves. And “tbr” stands for “to be read”.)

In this document, when I say “Goodreads data” or “Goodreads dataset”, I am referring to the five datasets generally. For individual datasets, I will use their specific names.

(All data is in the “data” folder, and I didn’t have to clean up any of the five original datasets, so they’re not in there.)

Why am I interested in this dataset?

The reason why I’m interested in this dataset is extremely simple: I love reading. Analyzing Goodreads data / Goodreads user data seems super fun.

More specifically, I’m interested in analyzing user habits on Goodreads. This is why I chose this dataset. I’ve spent several hours combing through Kaggle, Google, Reddit, and other places on the internet. As far as I can tell, this is the most comprehensive dataset about books and user readings habits I can find, freely available from the internet. It was scraped from Goodread’s and Goodread users’s publicly available data.

More about the data

The Goodreads dataset is of high quality.

There are no missingness values in any of the five dataset included in the Goodreads dataset except for “books”, which is missing some values in: language_code, isbn, isbn13, original_title, original_publication_year. This will not effect my exploratory data analysis, although I will keep the missingness in mind. (See “01_Exploring_Missingness.R” for more.)

The “books” dataset:

  • There are 10,000 observations.

  • There are 23 variables.

  • There are 17 numerical variables.

  • There are 6 categorical variables.

The other four datasets can be combined with each other and with the “books” dataset in ways that can enhance my exploration of Goodreads data and user habits.

My Objectives

There are several questions I’m interested in examining in Goodreads data.

Some of these questions include:

  • What are the most highly rated books?

  • Do readers who leave the most ratings leave the higher ratings on average or lower ratings on average?

  • Is there any relationship between the number of times a book appears in the “tbr” dataset and the number of ratings it received?

  • What is the relationship between a book’s average rating and the year it was published?

  • How do users tag the most highly rated books? Is there a trend?

I have divided my exploration of the Goodreads dataset into two parts. The first part looks at the rating habits of Goodreads users. The second part focuses on how Goodreads users interact with young adult (YA) novels on Goodreads.

Joining Data

Note

The joining data step took a lot of exploration due to some unclearness in the README I got this dataset from. I made a qmd in this project called “00_Joining_Datasets” that walks through my process of joining these two datasets in more depth.

Step 1: Joining tags and books_tags

Why am I joining tags and books_tags?

When I join the “books_tags” dataset together with the “tags” dataset, I can figure out what each book (goodreads_book_id) was tagged with (tag_name). (I would still have to match each goodreads_book_id with its title, but that’s another join.)

Step 2: Joining books and ratings

Why am I joining books and ratings?

When I join the “books” dataset with the “ratings” dataset, I can see the rating each user gave each book.

Joining “books” and “ratings” will allow me to start answering on three of my research questions: “What are the most highly rated books?”, “What is the relationship between a book’s average rating and the year it was published?” and “Do readers who leave the most ratings leave the higher ratings on average or lower ratings on average?”

Now, for the fun part. An EDA with our two new datasets, “books_and_tags” and “books_and_ratings”. And maybe some of the original datasets too.

Exploration 1: The Rating Habits of Goodreads Users

Part 1: Which 10 books have the highest average Goodreads rating?

To answer this question, I’m going to look at the books dataset. The variable average_rating can help us figure this out.

10 Books with the Highest Average Goodreads Rating

Table 1: 10 Books with the Highest Average Goodreads Rating

I’m surprised at some of the ratings of these books, yet very much not surprised at others. These are some things we can see from Table 1:

  • “The Complete Calvin and Hobbes” is more geared toward middle-grade readers, so it’s odd that it has the highest rating.

  • But I’m not surprised that “Words of Radiance” had the second highest average Goodreads rating. I know that the Goodreads demographic tends to be young adults and book bloggers, most of who all enjoy reading and rating young adult fiction on Goodreads. That’s a category this book falls within.

  • I definitely expected a Harry Potter book to be one of Goodreads’s highest average books, but not necessarily a boxed set. (Although I should note, the problem with Goodreads is that it considers boxed sets individual books, and there are neither string functions that I can use to filter them from individual books nor tags I can use to filter them, since the naming conventions vary from boxed set to boxed set and closely resemble how individual book titles appear.) But just from being a casual Goodreads user, I’ve observed that boxed sets tend to get higher ratings than individual books of the series. So in a way, it is unsurprising that the trend holds true here.

  • The “ESV Study Bible” is also something I expected to see on this list, although I expected it to be ranked higher. The Bible is an important book for many people.

  • I am surprised that various spinoffs of Calvin and Hobbes and Harry Potter dominate 7 of 10 rankings in the top 10 most highly rated books. I knew they were popular, but not that popular. But I guess Goodreads is also an American company, and those are the types of books Americans read and love.

Part 2: Do people hate-rate or love-rate books?

In other words, is there a relationship between a book’s number of ratings and its average rating?

There appears to be no relationship between a book’s average rating and its rank based on the number of ratings it received.

Looking at the correlation Table 2 confirms this: The correlation between rank_of_number_ratings andaverage_rating is almost 0.

Table 2: The correlation between a book’s rank and its average rating
average_rating rank_of_number_ratings
average_rating 1.0000000 -0.0863433
rank_of_number_ratings -0.0863433 1.0000000

I’m very surprised at this finding. I had expected the rank of the number of ratings a book received to have at least some correlation with the average rating.

So, it seems people are neither more nor less inclined to rate a book based on whether they hated or loved that book.

Part 3: How many books and what percent of books on Goodreads are written in English?

This dataset is a dataset of the 10,000 most rated books on Goodreads. The graph above is so green because most of these books are in some variety of English. Unsurprisingly, the people who use Goodreads, an American-based company, mostly rate books written in English and other European languages.

So, how many books and what percent of books on Goodreads are written in English?

Table 3: How many books and what percent of books on Goodreads are written in English?

Table 3 reveals both some surprising and some unsurprising data:

  • The green-ness of Table 2 already told us to expect that most books on Goodreads would be written in some variety of English. However, I was surprised at the sheer extent of books written in English: 63.41% of books have the language code “eng”, while 20.7% of books were written in American English and 2.57% were written in British English.

  • I did not expect that “Other” would make up 10.84% of the 10,000 most rated books on Goodreads. That’s a lot of other languages.

  • I was also surprised that Arabic was the 5th most popular language on Goodreads, after several varieties of English and “Other”. I instead expected European languages to dominate the Top 10 most popular languages on here since Goodreads is an American company, but Arabic made the list.

Part 4: What’s the distribution of book ratings?

When I look at a Goodreads rating, I don’t think of the rating scale as continuous. I think about them in bins of 0.25.

For example, a book with ≥ 4.5 ratings is excellent. A book with ≥ 4.75 is practically unheard of. And a book with a rating between 4.0 and 4.25 is great.

So that’s why instead of using a histogram, I’m going to put average ratings into bins of 0.25 and create a bar plot that will display the distribution of book ratings on Goodreads in a way that’s intuitive to think about.

Distribution of Book Ratings

Table 4: Distribution of Book Ratings
Average Rating Number of Books
(2.25,2.5] 1
(2.5,2.75] 1
(2.75,3] 12
(3,3.25] 66
(3.25,3.5] 275
(3.5,3.75] 1189
(3.75,4] 3269
(4,4.25] 3695
(4.25,4.5] 1363
(4.5,4.75] 124
(4.75,5] 5

A few things I found interesting about the distribution of the average rating of books on Goodreads:

  • It’s left skewed and unimodal.

  • Most books (3695 of them, as seen in Table 4) have a rating between 4 and 4.25.

  • Another 3269 books (see Table 4) have a rating between 3.75 and 4.

  • 1 book has an average rating between 2.25 and 2.5. 1 book has a rating between 2.5 and 2.75.

  • 5 books have an average rating between 4.75 and 5.0.

So clearly, there are very very few books in the 10,000 most rated books on Goodreads with an average rating of less than 3. And there are very very few books with an average rating of more than 4.75.

That means that people who rate the popular books on Goodreads usually either:

    1. Like the book enough that they went to Goodreads and rated it decently highly (a 3, 4, or 5).
    1. Don’t like to rate extremely high or low.
    1. Or the number of people who gave books a decently high rating tend to pull up the average ratings of the people who rate books lowly.

Without more data, it’s hard to tell whether we can explain away the cluster around book ratings between 3.75 and 4.25 as one of these suggestions, a combination of these suggestions, or none of these suggestions. It’s still fun to think about though.

Also, I should note again that in the context of this dataset containing only the most rated (ie. popular) books on Goodreads, it does make sense that books with lower ratings likely aren’t promoted enough — and therefore likely aren’t rated enough — to appear on this dataset.

Part 5: Do readers who leave the most ratings leave the higher ratings on average or lower ratings on average?

Part 5, Section 1: Who are the readers who leave the most ratings on Goodreads?

Each user (user_id) can give one rating to one book. There are 53,424 users who rated the 10,000 most rated books on Goodreads. These are the top 10 raters.

Top 10 Goodreads Users Who Rated the Most Books in this Dataset

Table 5: The 10 Goodreads Users Who Rated the Most Books in this Dataset
user_id count
12874 200
30944 200
12381 199
28158 199
52036 199
6630 197
45554 197
7563 196
9668 196
9806 196

Table 5 shows that the users who rated the most books in this dataset all rated around 200 books. If we also keep in mind that this dataset only has the data of the 10,000 most rated books on Goodreads, having users who rated 200 books of these books is quite impressive.

At the same time, I did expect this number to be higher, especially since I know the people who tend to use Goodreads are bookworms who can read dozens of books per year. But maybe since Goodreads was launched in January 2007 and did not gain popularity until more people have access to technology, this makes sense.

Part 5, Section 2: How do top raters tend to rate books?

Table 6: The correlation between a user’s rank and their average rating
count id average_rating
count 1.0000000 -0.0634094 -0.0825340
id -0.0634094 1.0000000 -0.0150214
average_rating -0.0825340 -0.0150214 1.0000000

I’m not surprised that Table 6 shows there’s no correlation between how many books someone rated and whether they had a higher or lower average rating.

Table 2 looked at whether there was a correlation between a book’s average rating and its rank it based on the number of ratings it received.

In retrospect, it makes sense that if there’s no rating between number of ratings and average rating, top Goodreads users wouldn’t differ from the average trend.

Exploration 2: Young Adult Novels on Goodreads

Part 1: What are the most common tags?

Table 7: What are the most common tags?

Part 2: Which are the highest rated young adult (YA) books?

100 Most Highly Rated YA Books

Table 8: 100 Most Highly Rated YA Books

Part 3: What years were young adult (YA) books published in?

Part 4: Which YA books have more ratings?

Part 5: Do YA books published in certain years tend to be more highly rated than other YA books?

Conclusion

References

Zając, Zygmunt (2017, Sept. 19). Github. https://github.com/zygmuntz/goodbooks-10k